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ABSTRACT 

We introduce GRay, a massively parallel integrator designed to trace the trajectories of billions of photons 
in a curved spacetime. This GPU-based integrator employs the stream processing paradigm, is implemented 
in CUDA C/C++, and runs on nVidia graphics cards. The peak performance of GRay using single precision 
floating-point arithmetic on a single GPU exceeds 300 GFLOP (or 1 nanosecond per photon per time step). 
For a realistic problem, where the peak performance cannot be reached, GRay is two orders of magnitude faster 
than existing CPU-based ray tracing codes. This performance enhancement allows more effective searches 
of large parameter spaces when comparing theoretical predictions of images, spectra, and lightcurves from 
the vicinities of compact objects to observations. GRay can also perform on-the-fly ray tracing within general 
relativistic magnetohydrodynamic algorithms that simulate accretion flows around compact objects. Making 
use of this algorithm, we calculate the properties of the shadows of Kerr black holes and the photon rings that 
surround them. We also provide accurate fitting formulae of their dependencies on black hole spin and observer 
inclination, which can be used to interpret upcoming observations of the black holes at the center of the Milky 
Way, as well as M87, with the Event Horizon Telescope. 
Subject headings: numerical methods — radiative transfer — relativity 



1. INTRODUCTION 

The propagation of photons in the curved spacetimes 
around black holes and neutron stars determines the appear- 
ance of these compact objects to an observer at infinity as 
well as the thermodynamic properties of the accretion flows 
around them. This strong-field lensing imprints characteristic 
signatures of the spacetimes on the emerging radiation, which 
have been exploited in various attempts to infer the properties 
of the compact objects themselves. 

As an example, special and general relativistic effects 
broaden fluorescence lines that originate in the accretion disks 
and give them the characteristic, asymmetric and double- 
peaked profiles that have been used in inferring black hole 
spins in acti ve galactic nuclei and in galactic sources (see 
Miller! 120071 for a review). In recent years, this approach 
has provided strong evid ence for rapid spins in black h oles 
such as MCG 6-30-15 (iBrenneman & Reynolds! 120061) and 
1H 0707-495 (F abian et all 120091) and is expected to ma- 
ture even further with upcoming observations with Astro-H 
dTakahashi et al.ll2012l) . 

A similar application of strong-field lensing is encountered 
in modeling the images from the accretion flows around the 
black holes in the center of the Milky Way and M87 (e.g., 
iBroderick et alj|2009t iDexter et a!]|2009l) . In the near future, 
such lensing models will be crucial for interpreting imaging 
observatio ns of these two sourc es with the Event Horizon 
Telescope (Doeleman et al. 20091). 

Finally, strong-field lensing around a spinning neutron star 
determines the pulse profile generated from a hot spots on 
its surface. The pulsation amplitude in such a lightcurve 
depends sensitively on the compactness of the neutron star 
(Pech enTck et alj fl983f) . For this reason, comparing model 
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to observed pulsation lightcurves has led to coarse measure- 
ments of the neutron- star properties in rotation-powered (e.g., 
Bogd anpv et alj 120071) an d accretion-powe red millisecond 
pulsars dLeahv et al.ll2008l) and bursters (e.g.. IWeinberg et aT] 
1200 It iMuno et al.l 120021) This technique shapes the key sci- 
ence goals of two proposed X-ray miss ions, ESA's LOFT 
(iFerqci et al.ll2012l) and NASA's NICER (lArzoumanian et al l 
2009). 

The general ray tracing problem in a relativistic space- 
time has been addressed b y se ve ral research gro ups to 



date (e.g., ICunnin gharri 



[19751 Pechenic k et all 11983, 
Laoil fl99lt ISpeithetafl [19951 [Miller & Lambl [1998 : 
Braie & Romanill2002l iDovciak et alJl2004t iBrodericklbood: 
Cadeauetal.1 120071: IDexter & Agol l2009t IDolence et all 
120091: iPsaltis & Johannsenl 120121; feaubock et al.l 120121) fol- 
lowing two general approaches. In one approach, which is 
only applicable to the Kerr spacetime of spinning black holes, 
several integrals of motion are used to reduce the order of 
the differential equations. In the other approach, which can 
be used both in the case of black holes and neutron stars, the 
second-order geodesic equations are integrated. 

The Kerr metric is of Petrov-type D and, therefore, the 
Carter constant Q provides a third integral of motion along 
the trajectories of pho tons, making the first appr oach possi- 
ble (see discussion in Johann sen & Psal tis 2010). Introduc- 
ing a deviation from the Kerr metric, however, either in order 
to model neutron-star spacetimes, which can have different 
multipole moments, or to test the no-hair theorem of black 
holes, does not necessarily preserve its Petrov-type D and 
the Carter constant is no longer conserved along geodesies 
(actually no such Killing tensor exists in spacetimes that are 
not of Petrov-type D). As a result, ray tracing in a non-Kerr 
metric requires integrating the second-order differential equa- 
tio ns for individual geodesies . Our current algori thm based 
on IPsaltis & Johannsenl d2012l) . as well as those o flBroderickl 
(2006). ICadeau et alJ (l200'7ir and IDolence et alJ (120091) . fol- 
low the latter approach, making them applicable to a wider 
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range of astrophysical settings. 

Although the latter algorithms are not limited by assump- 
tions regarding the spacetimes of the compact objects and 
reach efficiencies of ~ 10 4 geodesic integrations per second, 
they are still not at the level of efficiency necessary for the 
applications discussed earlier. For example, in order to sim- 
ulate the X-ray characteristics of the accretion flow around 
a black hole we need to calculate images and spectra from 
the innermost ~ 100 gravitational radii around the black hole. 
In order to capture fine details (such as those introduced by 
rays that graze the photon orbit and ca n affect the detailed 
images and iron-line profiles; see e.g., iJoha nnsen & Psaltis 
2010; Bec kwift&Done1l2005l) we need to resolve the im- 
age plane with a grid spacing of < 0.1 gravitational radii. 
As a result, for a single image, we need to trace at least 10 6 
geodesies. Even at the current best rate of ~ 10 4 geodesic in- 
tegrations per second, a single monoenergetic image at a sin- 
gle instant in time will require ^100 seconds on a fast work- 
station. This is prohibitively slow, if, for example, we aim to 
simulate time-variable emission from a numerical simulation 
or perform large parameter studies of black hole spins, accre- 
tion rates, and observer inclinations, when fitting line profiles 
to data. 

A potential resolution to this bottleneck is calculating a 
large library of geodesies, storing them on the disk, and us- 
ing them with an appropriate interpolation routine either in 
numerical simulations or when fitting data. To estimate the re- 
quirements for this approach, we consider, for the sake of the 
argument, a rather coarse grid on the image plane, spanning 
100 M in each direction, with a resolution of IM. In principle, 
we can refine this grid only for those impact parameters that 
correspond to geodesies that graze the photon orbit. In order, 
e.g., to integrate the radiative transfer equation for each one 
of these 10 4 geodesies that reach the image plane, we need to 
store enough information to reproduce the trajectory without 
recalculating it. Assuming a coarse resolution again, we may 
choose to store ^100 points per geodesic within the inner 100 
gravitational radii. Along each point, we will need to store 
at least three components of the photon 4-momentum (since 
we can always calculate the fourth component by the require- 
ment that the photon traces a null geodesic). For single preci- 
sion storage (i.e., 4 bytes per number), we will need to store 
4 x 3x 100 x 10 4 = 12 MB of information per image. If we 
want now to use a rather coarse grid of ~ 30 values in black 
hole spin and ~ 30 values in the inclination of the observer, 
we need to make use of a 30 x 30 x 12 MB= 1 1 GB database. 
Such a database can only be stored in a hard disk. At an av- 
erage latency time of ~ 1 ms for current disks, the efficiency 
of this approach cannot exceed ~ 10 3 geodesies per second 
(given that a typical disk sector has a size of at most 4KB and 
can handle the data of no more than a few geodesies). This is 
actually comparable to and, in fact, lower than the efficiency 
one would achieve by calculating the geodesies in the first 
place. Note also, that this estimate was performed for a very 
coarse grid. 

The good news is that ray tracing in vacuum is a trivially 
parallelizable algorithm, as individual rays follow indepen- 
dent paths in the spacetime. Our goal in this paper is to present 
a new, massively parallel algorithm that exploits the recent 
advances in state-of-the-art Graphics Processing Unit (GPU) 
platforms designed specifically to handle a large number of 
parallel threads for ray tracing in general computer visualiza- 
tion (see Figure [T). 




Figure 1. A screen shot of GRay in the interactive mode. The image shows 
the deformation of a plane-parallel grid of photons originating at a large dis- 
tance from a spin 0.999 black hole, as they pass the event horizon (white 
sphere near the center of the screen shot). The data always reside on the 
graphics card memory and the visualization is done by CUDA-OpenGL in- 
teroperability (see section|2j- While the geodesic equations are integrated by 
CUDA, the coordinate transformation and particle rendering of the same data 
are both done by the OpenGL shader. 

Our algorithm is based on t he ray tracing approa ch of 
iPsaltis & Johannsenl (120121) and iBaubock et al, (2012}), em- 
ploys nVidia's proprietary Compute Unified Device Ar- 
chitecture (CUDA) framework, and is implemented in 
CUDA C/C++. We briefly describe the implementation and 
list the benchmark results in the next section. As an applica- 
tion, we take advantage of the speed of the code and compute 
the shadows of black holes of different spins at different incli- 
nations in section|4] Finally, we discuss future applications of 
the code such as ray tracing on-the-fly with general relativistic 
magnetohydrodynamic models of accretion flows in section|5] 

2. IMPLEMENTATION, NUMERICAL SCHEME, AND FEATURES 

GPUs were originally developed to handle computation- 
ally intensive graphic applications. They provide hardware 
accelerated rendering in computer graphics, computer-aided 
design, video games, etc. Indeed, modern GPUs are opti- 
mized specifically with ray tracing in mind (albeit for what 
we would call, flat, Euclidean spaces). However, they have re- 
cently found extensive use in scientific computing, known as 
General-Purpose (computing on) Graphics Processing Units 
(GPGPU, see http : / / gpgpu . org), as they provide a low- 
cost, massively parallel platform for computations that do not 
have large memory needs. These two attributes make GPU 
technology optimal for the solution of ray tracing in curved 
spacetimes. 

GPUs achieve their high performance by adopting the 
stream processing paradigm, which is one kind of single in- 
struction, multiple data (SIMD) architectures. There are hun- 
dreds^ of stream processors on a single chip. These stream 

4 For example, there are 16 multiprocessors on nVidia Tesla M2090. Each 
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Figure 2. Numerical difficulties occur when the substeps of a 4th-order Runge-Kutta update is evaluated very close to, or on the two different sides, of the 
poles. In the left panel, the gray circle marks the location of the event horizon for a spin 0.99 black hole. The vertical black line is the coordinate pole. The red 
solid, green dashed, and blue dotted lines are numerical trajectories of the photons with the same initial conditions (see text) but with different treatments of the 
coordinate singularity at the pole. All three trajectories go around the south pole without any apparent problems and wind back to the north pole. However, while 
the red and green trajectories go through the north pole and eventually hit the event horizon, the blue trajectory is kicked back to infinity. The central panel zooms 
in by a factor of 100 to show that the blue trajectory fails to step across the pole. To understand this numerical problem, we overplot all the sub- and full-steps by 
open and filled circles, respectively. The two overlapping open blue circles are evaluated very close to the pole. The right panel zooms further in by another factor 
of 1000. It is now clear that the two open blue circles are located on different sides of the pole. The low-order truncation errors in the 4fh-order Runge-Kutta 
scheme, instead of canceling, are enhanced. The green trajectory avoids this numerical problem by falling back to a lst-order forward Euler scheme, marked 
by the green diamond in the central panel, whenever the geodesic moves across the pole. This lst-order treatment has a larger truncation error because of the 
lst-order stepping — this is visible in the central panel; and it may fail if a full step (filled circle) gets too close to the pole. To reduce the truncation error and 
make the integrator more robust, the red trajectory uses the forward Euler scheme with a smaller time step, which is marked by the red diamond in the central 
panel. The subsequent steps are all shifted to avoid the pole. This final treatment is what we employ in the production scheme. 



processors are designed to perform relatively simple compu- 
tation in parallel. On the other hand, the on-chip support of 
caching (fast memory and their automatic management) and 
branching (conditional code execution, i.e., if-else statements) 
is primitive. The developers are responsible for ensuring effi- 
cient memory access. 

This architecture allows most of the transistors to be de- 
voted to performance arithmetic and yields an impressive 
peak performance. In addition, GPUs hide memory latency by 
fast switching between computing threads — developers are 
encouraged to over-subscribe the physical stream processors 
in order to keep the GPU busy. This is a very different design 
compared to general purpose multi-core Central Processing 
Units (CPUs), which uses multiple instruction, multiple data 
(MIMD) architecture and have intelligent cache management 
and branch predictor to maximize the performance. 

Although Open Computing Language (OpenCL) is the in- 
dustrial open standard of GPGPU programming, we choose 
CUDA C/C++ to implement this publicly available version 
of GRay because of the availability of good textbooks (e.g., 
iKirk & HwvJl20ia ISanders & KandrotlhoiOl) and the easier 
learning curve. In CUDA terminology, the host (i.e., the 
CPU and the main memory) sends a parallel task to the de- 
vice (i.e., the GPU and the graphics card memory) by launch- 
ing a computing kernel. The kernel runs concurrently on the 
device involving many lightweight computing threads. Be- 
cause of hardware limitation, threads are organized in blocks 
(of threads) and grids (of blocks). Threads within a block can 
communicate with each other by using a small amount of fast, 
on-chip shared memory; while threads across different blocks 
can only communicate by accessing a slow, on-card global 
memory. 

Because geodesies do not interact with each other, in GRay, 

multiprocessor is made up by 32 cores. Hence, there are a total of 512 stream 
processors on a single GPU. 



we simply put each geodesic into a CUDA thread. The states 
of the photons are stored as an array of structure, which, un- 
fortunately, is not optimal for the GPU to access. In order to 
maximize the bandwidth, we employ an in-block data trans- 
pose by using the shared memorjQ. We fix the block size, 
i.e., the number of threads within a block, «bi ck> t0 64, which 
is larger than the number of physical stream processors in a 
multiprocessor. This over-subscription keeps the GPU busy 
by allowing a stream processor to work on a thread while 
waiting for the data for another thread to arriveQ. The grid 
size, i.e., the number of blocks within a grid, n„M is com- 
puted by the idiomatic for mula (see, e.g., IKirk & Hwull2010l 
Sanders & Kandrot 2010, or sample codes provided by the 
CUDA software development kit) 

"grid = l(n- l)/«biockJ + 1, (1) 

where n is the total number of photons and |_ • J is the floor 
function. The above formula ensures that n g rid"biock > n so 
there are enough threads to integrate all the photons. 

We emplo y a standard 4th-order Ru nge-Kutta scheme 
presented in IPsaltis & JohannsenI (120121) to integrate equa- 
tions (9)— (12). To avoid the coordinate singularity of the Kerr 
metric at the event horizon rt, n = 1 + V 1 — a 2 , we set the step 
size as 

A y = . f A r-rbh \ 

~ mm V \d\nr/dk'\ + \d0/dl'\ + \d<j) /dX'\ ' 2\dr/dX'\ ) 

(2) 

and stop integrating the photon trajectory at r^h + 8 to avoid it 
crossing the horizon at n*. Both A ~ 1/32 and 8 ~ 10~ 6 are 

5 GRay integrates each geodesic for many steps in a single data load. The 
in-block transpose, therefore, only improves GRay's performance in the inter- 
active mode, where we limit the number of steps to trade for response time. 

6 Because ray tracing in the Kerr spacetime is computationally intensive, 
this over-subscription does not play a crucial role in GRay's performance. 
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user provided parameters. In addition, we use the remapping 
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GRay, we follow iPsaltis & Johannsenl d2012l) to monitor the 
quantity 



(3) 



to enforce 9 to stay in the domain [0,7t]. 

The scheme described above can accurately integrate al- 
most all geodesies. However, it breaks down for some of the 
geodesies that pass through the poles at 9 — or n. To illus- 
trate how the scheme breaks down, we choose the special ini- 
tial conditions ro cos 9q = 1000M, r sin 9q = 4.833605M, and 
00 = for which the photon trajectory passes both the south 
and north poles of a spin 0.99 black holfl In each panel of 
Figure |2 we plot the result of tracing the above ray with blue 
dotted lines. 

In the left panel of Figure [2] the gray circle marks the lo- 
cation of the event horizon for the spin 0.99 black hole. The 
vertical black line is the pole. The green dashed and the red 
solid lines are the numerical trajectories of the photons with 
the same initial conditions but with different treatments of the 
coordinate singularity at the pole, as we will describe below. 
All three trajectories go around the south pole without any 
apparent problem and wind back to the north pole. While the 
red and green trajectories go through the north pole, circulate 
around the black hole a couple times, and eventually hit the 
event horizon, the blue trajectory is kicked back to infinity due 
to a numerical error. 

The central panel of Figure [2] is a lOOx magnification of 
the region where the trajectories intersect with the north pole. 
It shows that the blue trajectory fails to step correctly across 
the pole. To pinpoint this numerical difficulty, we overplot all 
the Runge-Kutta sub- and full-steps by open and filled circles, 
respectively. The two overlapping open blue circles land very 
close to the pole. 

The right panel offers a further lOOOx magnification of the 
same region. It is now clear that the two nearly overlapping 
open blue circles sit actually on the opposite sides of the pole. 
This is a problem for the 4th-order Runge-Kutta scheme, in 
which the solution is assumed to be smooth and can be Taylor 
expanded. In this scheme, the low-order truncation errors are 
normally canceled by a clever combination of the substeps. 
Evaluating the geodesic equation in the different substeps on 
the two sides of the pole, however, introduces an inconsis- 
tency in the scheme and enhances the low-order truncation 
errors. 

The green trajectory in Figure |2] shows the result of an 
improved scheme, which solves the inconsistency by falling 
back to a lst-order forward Euler step whenever a geodesic 
moves across the pole. The low-order step is marked by the 
green diamond in the central panel. This treatment mends 
the numerical difficulty and allows the photon to pass through 
the pole. Unfortunately, the low order stepping results in a 
larger truncation error in the numerical solution. The small 
but visible offset between the green trajectory and the other 
two trajectories in the central panel is indeed caused by the 
low order step at the south pole. Even worse, this treatment 
may fail when a full step (i.e., the filled circles) gets too close 
to the pole. 

To reduce the truncation error of the low order step and 
make the integrator more robust, in the production scheme of 

7 Because of cylindrical symmetry, the value of 0o is not important in this 
setup. Indeed, for <po = 0°, 1°,2°, . . . ,359°, the same pole problem is always 
encountered. 
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which should always remain equal to — 1. If || + 1| > e, for 
some small parameter e ~ 10~ 3 in the numerical scheme, we 
re-integrate the inaccurate step by falling back to the lst-order 
forward Euler scheme with a smaller time step AA'/9. This 
step size is chosen so that (i) the absolute numerical error of 
the solution does not increase substantially because of this 
single low-order step, and (ii) the pole is not encountered even 
if the Euler scheme is continuously applied. This lst-order 
step is marked by the red diamond in the central panel of Fig- 
ure|2] The subsequent steps, as shown in the figure by the red 
circles, are all shifted towards the left and skip the pole. 

We find this final pole treatment extremely robust and use 
it for all our production calculations. For the 3 x 10 8 trajec- 
tories that we will integrate in section |4] none of them fails at 
the pole as long as we fix A = 10~ 6 . The rest of the implemen- 
tation of the algorithm, the initial conditi ons, and the setup of 
the ray s on the image plane proceed as in Psa ltis & Johannsenl 

In addition to performing the computation of ray tracing, 
GRay takes advantage of the programmable graphics pipeline 
to perform real time data visualization. It can be compiled 
in an interactive mode by enabling OpenGL. The OpenGL 
frame buffer is allocated on the graphics card, which is 
then mapped to CUD A for ray tracing. This technique is 
called CUDA-OpenGL interoperability — there is no need 
to transfer the data between the host and the device. Be- 
cause the data reside on the graphics card and are accessible 
to OpenGL, we use the OpenGL Shading Language (GLSL, 
see |http : //www . opengl .or g) to perform coordinate trans- 
formation and sprite drawing. A screen shot of this built-in 
real time visualization is provided in Figure Q] 

3. BENCHMARKS 

The theoretical peak performance of a high-end GPU is 
always about an order magnitude faster than the peak per- 
formance of a high-end multi-core CPU (iKirk & Hwufe OlO: 
ISanders & Kandrol 120101) . However, because of the funda- 
mental difference in the hardware design, their real world per- 
formances depend on the nature of the problem and the im- 
plementation of the algorithms. In order to compare different 
aspects of the implementation of the ray-tracing algorithm, 
we perform two different benchmarks on three codes in this 
section. 

1. Geokerr is a well established, publicly available code 
written in FORTRAN. The code uses a semi-analytical 
approach to solve for null geodesies in Kerr spacetimes, 
which leads to accurate solutions even with arbitrarily 
large time stepfl T he deta i ls of t he algorithm are doc- 
umented in iDexter & A"gol (120091) . 

8 Note, however, that substeps are need if there are turning points in the 
null geodesies. 
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Figure 3. Projections of a uniform Cartesian grid in the image plane to 
the equatorial plane of spin (left column) and 0.95 (right column) black 
holes. The images in the top and bottom rows have inclination angles 0° 
and 60°, respectively. Th e y are plotted in a way to match Figure 2 o f 
ISchnittm an & Bertschingejj C004D and Figure 3 of Dexter & Ago! 120091) ; 
i.e., the horizontal and vertical axes correspond to the — )3o- and — 0!o- 
directions. The configuration in the lower right panel with parameters a = 
0.95 and i = 60° is the representative ray tracing problem we use in Fig.|4]for 
the comparative benchmarks. 
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Figure 4. Results of the grid projection benchmark for the configuration 
shown in the lower-right panel of Figure [5] The run times of three differ- 
ent algorithms, Geokerr (blue diamonds), Ray (green triangles), and GRay 
(red circles) in double precision, are plotted against the number of geodesies 
traced for each image. The asymptotic linear dependence seen for all three al- 
gorithms demonstrates explicitly that the ray tracing problem is highly paral- 
lelizable. For a small number of geodesies, the performance of GRay flattens 
to a constant value (approximately 20 ms for the configuration used) because 
of the time required for launching the CUDA kernel. 



2. Ray is an algorithm that uses a standard 4th-order 
Runge-Kutta scheme to integrate the geodesic equa- 
tions in spacetimes with arbitrary quadrupole moments. 
It is written in C and runs efficiently on CPUs. The 
code has been used to test the no-hair theorem and gen- 
erate profiles and spectra from spinning neutro n stars 
(iPsartis & Johannsenll20iaiBaubock et alJl2Q12h . 

3. GRay, the open source GPU code we describe in this 
paper, is based on Ray's algorithm. It is written in 
CUDA C/C++ and runs efficiently on most nVidia 
GPUs. The source code is published under the GNU 
General Public License Version 3 and is available at 
https : //github . com/ chanchikwan/gr ay . 

For the first benchmark, we compute the projection of a 
uniform Cartesian grid in the image plane onto the equato- 
rial pl ane of a spinning black hole . This problem was carried 
out in Schnittman & Bertschinger (2004) and then used as a 
test case in iDexter & Agoll (12009ft . We reproduce the pub- 
lished results in Figure [3] using GRay and initializing the im- 
age plane at r = 1000M. The left and right columns show 
the projections for two black holes with spins and 0.95, re- 
spectively; in each case, the top and bottom rows represent 
observer inclinations of 0° and 60°, respectively. 

The case shown in the lower-right panel of Figure [3] with 
parameters a = 0.95 and i = 60° is a representative problem, 
which we will use as a benchmark. We use the three algo- 
rithms Geokerr, Ray, and GRay and calculate the projection 
using a grid of n geodesies for each method. We plot the run 
time on a single processor of each calculation as a function of 
the number of geodesies traced in Figure [3] 



We can draw a few interesting conclusions from this sim- 
ple benchmark. The performance of all algorithms scales lin- 
early for almost all problems, signifying the fact that ray trac- 
ing is a highly parallelizable problem. For a small number of 
rays, the performance of GRay flattens at about 20 ms for the 
configuration used, because of the time required for launch- 
ing the CUDA Kernel. This is independent of the number of 
geodesies but, of course, depends on the specific hardware, 
drivers, and operating system used. For a MacBook Pro run- 
ning OS X, this time is of order a few tens of milliseconds, 
while the launching time may be as large as 0.5 seconds for 
some Linux configurations. 

For calculations with a large number of geodesies, which is 
the regime that motivated our work, GRay is faster than both 
Geokerr and Ray by one to two orders of magnitude. It is 
important to emphasize that the performance of GRay exceeds 
that of the other algorithms even in this benchmark that is 
designed in a way that favors the semi-analytical approach of 
Geokerr. This is true because we are only interested in the in- 
tersection of the ray with the equatorial plane, which Geokerr 
can achieve with a very small number of steps per ray. In more 
general radiative transfer problems, however, we have to di- 
vide each ray in small steps in all methods in order to inte- 
grate accurately the radiative transfer equation through black 
hole accretion flows. This requirement puts the Runge-Kutta 
integrators at a larger advantage compared to semi-analytic 
approaches. 

In order to assess the performance of GRay in this second 
situation, we setup a benchmark to measure the average time 
that the integrators require to take a single step in the integra- 
tion of a photon path. We list the results of this benchmark for 
the three algorithms in Table Q] where the numbers have unit 
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Table 1 

Benchmark results of GRay in comparison to other general relativistic ray tracing codes. 



Processor 



Geokerr 11 



Ray 



GRay 



nVidia Tesla M2090" 
nVidia GeForce GT 650M b 
nVidia Tesla M2090 
nVidia GeForce GT 650M 
Intel Core i7-3720QM 2.60GHz c 
Intel Xeon E5520 2.27GHz c 



23000 
43800 



356.67 
692.68 



1.15 
3.27 
7.15 
71.87 



Note. — We focus only at the performance of the geodesic integrators. The numbers listed in the above table have unit of nanosecond per time step per photon. 
Hence, smaller number indicates higher performance. 

11 Geokerr computes the geodesic semi-analytically and hence can take arbitrary long time steps unless there is a turning point in the geodesic. 
b Single precision floating arithmetic is used. 

c Both Geokerr and Ray are serial codes. Hence, only one CPU core is used in these measurements. 



of nanoseconds per timestep per photon, such that a smaller 
number indicates higher performance. In this benchmark, the 
benefit of the GPU integrator becomes clearly visible as GRay 
is 50 times faster than Ray and more than a factor of 1000 
faster than Geokerr. 

4. PROPERTIES OF PHOTON RINGS AROUND KERR BLACK HOLES 

Being a massively parallel algorithm, GRay is an ideal tool 
to study black hole images that involve integrating billions of 
photon trajectories. In general, the details of black hole im- 
ages depend on the time-dependent properties of the turbulent 
accretion flows (see also section[5]for a detailed discussion). 
In all cases, however, for optically thin accretion flows such 
as the one expected around Sgr A* at mm wavelengths, the 
projection of the circular photon orbit produces a bright ring 
on the image p l ane that stands out against the background (see 
Lumined[l979l iBeckwith & Dondl2005l Uohannsen & Psaltis 



spins, as significant asymm etries appear only at a > 0.99. In 
Johannsen & Psaltis (2010), we attributed this to the cancella- 
tion of the ellipsoidal geometry of the Kerr spacetime by the 
frame-dragging effects on the propagation of photons, which 
appears to be exact at the quadrupole order. 

In order to quantif y the magnitude of the ef fects discussed 
above, we follow Johan nsen & Psaltis] ( 120101) to define the 
horizontal displacement of the ring from the geometric cen- 
ter of the spacetime as 



D = 



the average radius of the ring as 



■Oqa 



i r 2ji 



(6) 



(7) 



2010). As pointed out by Uohannsen & Psaltis! d2010l) . the where R = [((Xq -D) 2 + jSq] and 9 = tan _1 (j3 /ao), and 



shape of this so called photon ring that surrounds the black 
hole shadow is a general relativistic effect and is insensitive 
to the complicated astrophysics of the accretion flows. Care- 
ful matching of the theoretical predictions of the photon ring 
with observations, therefore, provides an unmistakable way 
to measu re the black hole mass and even to test the no-h air 
theorem dJohannsen & Ps altis 20 lOl Uohannsen et al.ll2012l) . 

We performed a systematic calculation of the photon rings 
around Kerr black holes of different spins a and observer in- 
clinations i. We choose 16 values of spin according to the 
relation 



the asymmetry parameter as 

1 r2» 

A = 2 



2nJo 



(R-(R)) 2 d$ 



1/2 



(8) 



= 1-10 



-j/5 



where j = 0, 1,..., 15 



(5) 



so that 1 — a is evenly spaced in log scale, and 19 values of 
inclination i = 0,5, 10, . . . ,90. For each configuration, we set 
up the image plane at r = 1000M and define its center at the 
intersection of this plane with a radial line emerging out of the 
black hole. We define (M, ■&) to be the local polar coordinate 
of the image plane. We set up a grid of 6000 x 181 rays in the 
polar domain (1.5,7.5) x [0, n] and integrate them toward the 
black hole. Hence, there are 16 x 19 x 6000 x 181 « 3 x 10 8 
geodesies in this parameter stud>0. 

We plot the outli nes of the photon rings in Figures and 
[6] As discussed in Joha nnsen & Psaltis] (120101) . we find that 
the size of the photon ring depends very weakly on the spin 
of the black hole and the inclination of the observer. More- 
over, the ring retains a highly circular shape at even high 

9 We use git to manage the source code of GRay. For reproducibility, 
the setup of this parameter study is available in the source repository with 
commit id 4e20d4c0. See also the commit message for running the study. 



In the above equations, the coordinates Oo and j8o are under- 
stood to be measured on a two-dimensional Cartesian coordi- 
nate system on the image plane, i.e., they are related to 3fc and 
# by the coordinate transformation 

CCq = @cos&, (9) 
/3 = .5?sin#. (10) 

In Figure [7] we plot these ring quantities as functions of the 
observer inclination i at different black hole spins a. 

In order to facilitate the comparison of theoretical models to 
upcoming observations of black hole shadows with the Event 
Horizon Telescope, we have obtained simple analytic fits to 
the dependence of the average radius and asymmetry of the 
photon ring for black holes with different spins and observer 
inclinations. In particular, we find 

(R) ~2?o + #icos(2.14z'-22.9°) 

with 



(11) 



R = (5.2 -0.209a 
A' i = 0.24 



- 0.445a 2 — 0.567a 3 ) M 



3.3 



and 



(a -0.9017) 2 + 0.059 



A ~ Aq sin" i 



x 10 _i M (12) 



(13) 
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Figure 5. Photon rings around Kerr black holes with different spins a and observer inclinations i. The left, central, and right panels show the photon rings for 
i = 30°, 60°, and 90°, respectively. In each panel, different colors represent different spins — from black being a = to red being a = 0.999. For each inclination, 
the size of the photon ring depends very weakly on the black hole spin. Moreover, the photon ring retains its nearly circular shape even at high black hole spins; 
a significant distortion appears only for a > 0.99 and at large inclination angles. 



a = 0.369043 



= 0.974881 



5 - 



0-i = 




5 - 



- i = 




i' = 90 



( = 90 



a = 0.999 



( = 




z' = 90 
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-5 5 
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-5 
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Figure 6. Photon rings around Ken' black holes with different spins a and observer inclinations ;'. The left, central, and right panels plot the photon rings for 
a = 0.369043, 0.974881, and 0.999, respectively. In each panel, different colors represent different inclinations — going from black for i = 0°, to blue for i = 5°, 
10°, . . . , to red for i = 90°. The photon rings become asymmetric only for a > 0.99 and at large inclination angles. 




Inclination angle i (degree) Inclination angle i (degree) Inclination angle ( (degree) 



Figure 7. Photon ring properties for Kerr black holes with different spins a and different inclinations ;. The left, central, and right panels show the average radius 
(R), the horizontal displacement D, and the asymmetry parameter A of the rings, respectively. In each panel, the horizontal axis is the inclination (' and different 
colors represent different spins — from black for a = to red for a = 0.999. In the leftmost and rightmost panels, the solid curves show the analytic fits discussed 
in the text. 
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Figure 8. Contours of constant photon ring properties around Kerr black 
holes, which can be inferred with imaging observations of their accretion 
flows. The contours of constant average radius of the ring, (R), and asymme- 
try parameter, A, are plotted as solid red and dashed blue curves, respectively. 
If the two contour lines that correspond to an observed photon-ring radius and 
asymmetry for a black hole cross, then both the spin of the black hole and the 
inclination of the observer can be independently inferred. On the other hand, 
if the two contour l ines do not intersect, th is will indicate a violation of the 
no-hair theorem ( Johannsen & Psaltis 2010). 



with 



A = (0.332fl 3 + 0.176fl 2L7 + 0.0756fl' 95 )M 



n = 1.55(1 -a) 



-0.022 



1.3(1 -a) 



0.98 



(14) 



In all relations, the arguments of the trigonometric functions 
are in degrees. The above empirical relations are shown as 
solid curves in the leftmost and rightmost panels of Figure |7] 
In Figure [8] we provide a different representation of the 
above results, by plotting contours of constant average radius 
(R) and asymmetry parameter A on the parameter space of 
black hole spin a and observer incli nation i. The Event Hori- 
zon Telescope dDoeleman et al.ll2009f) aims to perform imag- 
ing observations of the inner accretion flows around the black 
holes in the center of the Milky Way and of M87, in order 
to measure these two parameters of the black hole shadows. 
(The displacement D cannot be readily measured, since there 
is very little indication of the geometric center of the space- 
time that can be obtained from the images). If the two contour 
lines that correspond to the observed radius and asymmetry 
for each black hole cross, then both the spin of the black hole 
and the inclination of the observer can be inferred indepen- 
dently using such observations. On the other hand, if the two 
contour lines do not intersect, this will indicate a violation of 
the no-hair theorem (Joh annsen & Psaltis 20K)|). 

5. DISCUSSIONS 

In this paper, we presented our implementation of the mas- 
sively parallel ray tracing algorithm GRay for GPU archi- 
tecture. We demonstrated that its performance is about two 
orders of magnitude faster than equivalent CPU ray tracing 
codes. Running this algorithm on an nVidia Tesla M2090 



card, we are able to compute a 1024 x 1024 pixel image 
in about a few seconds (see Figure |4). At the same time, 
we can achieve timesteps per photon as small as 1 ns on 
the same GPU card (see Table [TJ. Bearing in mind that 
communication is almost always slower than computation 
in high-performance computing (e.g., the host-device band- 
width, through PCI express, is at least an order of magnitude 
faster than the hard disk bandwidth, through S ATA) also leads 
us to conclude that using GPUs to perform the computation 
of geodesies when needed in an algorithm is a more efficient 
approach to solving this problem compared to tabulating pre- 
computed results in a database. 

Our initial goal is to use GRay to make significant advances 
in modeling and interpreting observational data. Neverthe- 
less, GRay will also be extremely useful in performing three 
dimensional, magnetohydrodynamic (MHD) calculations in 
full general relativity aiming to achieve ab initio simula- 
tions of MHD p rocesses in the vicinity of b l ack hole hori- 
zons ( see, e.g.. IDe Villiers & Hawlevl 20031; iGammie et all 
1 20031; iMizuno et aljj2006t iGiacomazzo & Rezzolla l 120071: 
bel Zanna et alj|2007t ICerda-Dura n et all 120081; IZinkll201 lb . 
Besides being very important for improving of our under- 
standing of accretion flows, MHD simulations have been in- 
strumental in interpret ing observations of Sgr A* and its un- 
usual flares (see. e.g.. | Chanet al1l2009HMo scibrodzk aetal.l 
2009; iDodds-Eden et al.ll201CH;lDolence et alj|2012h . 

Comparing the results of numerical simulations to obser- 
vations requires, at the very least, using the calculated time- 
dependent thermodynamic and hydrodynamic properties of 
the MHD flows to predict lightcurves, spectra, and images. 
At the same time, the propagation of radiation within the 
MHD flow contributes to its heating and cooling. In addi- 
tion, radiation forces determine even the dynamics of near 
Eddington accretion flows. Calculating the propagation of 
radiation within the accretion flow and to an observer at 
infinity in a time-dependent manner is very time consum- 
ing. It has been taken into account only in limited simula- 
tions and under various simplifying assumptions (see, e.g., 
IDe Vil liers 2008). In fact, only a handful of numerical al- 
gorithms have been used to date for calculations of observed 
quantities post facto, based on snapshots of MHD simulations 
dDexter & Agolll2009t iDolence et al.ll2009l) . This "fast light" 
approximation breaks down close to the black hole because 
of the speed of the plasma there is comparable to the speed 
of light. In order to overcome the storage requirement of fre- 
quent data dump, GRay may be integrated into a general rela- 
tivistic MHD code to perform ray tracing on-the-fly. 
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